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Transistors,  performance,  power .... 


Density  scales  at  traditional 
rate 

•  Caches  get  larger  and 
contribute  significantly  to 
transistor  count 

Clock  speed  and  power  level 
off 

•  Single  thread  performance  is 
power  limited 

•  Need  more  efficient  cooling 
solutions 

System  performance  can  will 
pick  up  with  parallelism 

•  Many  low  power  cores  will 
deliver  same  throughput 

•  Application  dependent 
solution 


Source:  Kunle  Olukotun,  Lance  Hammond,  Herb  Sutter,  and  Burton  Smith 
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Where  does  the  power  go? 

Karthick  Rajamani,  Boulder  TVC 
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■  Power  is  not  only  a  processor  problem, 
penetrates  the  whole  system 
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I/O  and  Disk  Storage 
Cooling 

Power  Subsystem 


The  power  problem 
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General  considerations 


Circuit 

cv2 


T 

CV 

w  - 

/ 

Chip 


Power  density 


Delay 


Pac,  ~  fCeffV2  P0Wer 

/  ~  a(V  -V0)  Frequncy 

Voltage  scaling  comes  at  the  cost  of  performance 


1 

' 

1  IBM  | 

March,  2009  Haensch  , 

©  2005  IBM  Corporation 


IBM  Research 


General  considerations 
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Four  core  processors 
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Voltage  scaling  comes  at  the  cost  of  performance 
Silver  bullet:  Low  V,  low  C,  high  I 
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Scaling  dilemma 
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loff  ~  100  nA/jum 

S  >  60  mV/dec 

lvth  ~  300  jum/L  nA/jum 

Cmv  ~  35  fF/jum2  -  EOT  ~  0.8nm 

\on~  1.5  mA/jum 

Vsat  ~  0.25V 

Simple  voltage  scaling  is 
challenging 
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Device  choices 

Ion  @  given  loff 
lefl  @  given  loff 

Short  channel  effects  will 
degrade  leff  faster  than  lon 

-  'on  ~  (V-Vfsat)“ 

-  lon  -  lhigh  ~  DIBL*V/2 

-  llow  ~  (V/2-Vtsat)« 
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Si  Device  Evolution  (I) 

High  p  options:  liner  stress, 
eSiGe,  SMT,  HOT,  SSDOI, .. 
Gate  stack:  high-k,  Metal  gate 
Junction  engineering 
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Body  control  SCE  will  enable  length  scaling 
without  aggressive  dielectric 


FinFET/TriGate 


FDSOI 
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Si  Device  Evolution  (II) 


High  (a,  options:  liner  stress, 
eSiGe,  SMT,  HOT,  SSDOI,  .. 
Gate  stack:  high-k,  Metal  gate 
Junction  engineering 
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Body  control  SCE  will  enable  length  scaling 
without  aggressive  dielectric 
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Optimal  technology  -  DIBL 
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Low  DIBL  is  more  advantageous  for  power  optimized 
performance . 
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Optimal  technology  -  supply  voltage 
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. because  it  allows  reduction  of  supply 

voltage  for  maximum  performance 
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Optimal  technology  -  gate  length 
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Superior  electrostatics  allows  shortest  gate  length  for 
FinFET 


Gate  length  are  longer  than  expected,  no  significant 
difference  for  low  power  or  high  performance 
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Optimal  technology  -  density 
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Only  moderate  gains  in  single  core 
performance  for  low  power 
technology 
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Power  delivery 


High 

Voltage 

- ► 


Power 

Loss 


VpD  ^ 


Voltage 

Regulator 


/  \ 

Off-Chip 

Buck 

Converter 
v _ _ _ / 


Microprocessor  Chip 


/ - 

/ - 

Core 

Cache 

v _ 

* 

Bringing  high  voltage  as  close  to  the  chip  would  reduce 
power  loss 
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The  case  for  high  voltage  power  delivery  to  ICs 

High  voltage  power  delivery  system  o  Put 
voltage  down-converter  close  to  the  point  of  use 

minimizes  power  loss 

improves  supply  stability 


Power  loss 

Ploss  _  I2R  _  ( P/vfR  _  PR 
P  P  P  V2 


Supply  stability 


LP 

v2 
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The  case  for  high  voltage  power  delivery  to  ICs 


Microprocessor  Chip 


High 

Voltage 


Need  efficient  on-chip  converters  that  are  compatible  with 
base  CMOS  technology 
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^  Operation 


Homogeneous 
Multi -Processing 

(e.g.  64  identical  processors) 


Critical  Path  =  5At 
Total  energy  =|68| 


Heterogeneous 

Multi-Processing 

(e.g.  1  HP  +  63  LP  processors) 

Critical  Path  =  5 At 
Total  energy  =|15 1 
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Depending  on  load  distribution,  heterogeneous  multi¬ 
processing  can  achieve  dramatic  power  reduction 
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The  case  for  high  voltage  power  delivery  to  ICs 
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Due  to  natural  process  variations  (  e.  g.  Lgate)  need  to 
regulate  supply  voltages  for  individual  cores  for 
performance  matching 
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The  case  for  high  voltage  power  delivery  to  ICs 
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Obstacles  to  low-voltage  operation 

■  Increased  gate  delays  -»  circuit  design  challenge, 

Increased  sensitivity  to  process  variability  device  design 
challenge. 


Variability  needs  to  be  minimized  for  low  voltage  operation. 
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Noise 
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•  Current  is  a  super-linear  function  of  VDD 
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Low  voltage  circuit  operation  -  SRAM 
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Read  and  write  operation  required  a  delicate  balance  of 
cell  devices  and  threshold  mismatch  will  impact 
operation  window,  in  particular  at  low  voltages 
Device  choices  can  significantly  reduce  operating 
voltage  for  6T  SRAM  cell 
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Low  voltage  circuit  operation  -  SRAM 
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8T  SRAM  cell  decouples  read  and  write  and  opens  up 
operating  window  for  the  cell 
Devices  can  be  designed  at  minimum  dimensions, 
which  will  benefit  cell  size 
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Off-chip  bandwidth  vs  on-chip  cache 


What  is  the  real  relationship? 
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Off-chip  bandwidth  vs  on-chip  cache 


Architecture  solution  depends  on  application 


Threads  (T) 


Emma’s  Law: 

2“+PT  =(2“B)  X(8PC) 
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Low  voltage  signaling 


Low  voltage  (e.g.  0.25V)  driver,  significantly  reduced 
power  consumption  on  final  driver:  ~  (0.25/0.925)2  =  13.7 

•  Full  Vdd  on  the  gates  of  the  nMOS  drivers  for  high  speed 

Gated  diode  receiver 

•  Recover  data  back  to  full  Vdd  swing 


Low  voltage 
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Three  Dimensional  Silicon  Integration 


3D  Si  Integration 
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Low  voltage  scaling 
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Engineering 


Haensch 


©  2005  IBM  Corporation 


Beating  the  sub-threshold  slope  limit 
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Tunnel  FETs  show  strong  voltage 
dependence  of  sub-threshold  slope 
On  current  not  yet  on  par  with 
conventional  high  performance  FETs  at 
Vgs  comparable  voltages 
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Beating  the  sub-threshold  slope  limit 
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Supply  voltage  for  max  performance 
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For  high  power  parts  there  is  little  advantage  for  supply 
voltage  reduction  due  to  steep  sub  threshold  devices 

For  low  power  parts  there  is  a  significant  power  supply 
reduction  potential  if  drive  cuirent  equivalent  to 
conventional  FET  can  be  achieved 
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SS  Engineering  -  22nm  node 
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Performance  SS  slope  trade-off  window  small  for  high 
power  case 

For  low  power  space  there  is  a  large  trade-off  window 
performance  versus  SS-slope 
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On-current  vs  SS-slope,  for  constant  performance 


High  power/  performance:  can  only  tolerate  30%  drive 
current  degradation 


Low  power  /  performance:  can  tolerate  x  10  degradation 
of  drive  current 
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Adiabatic  charging 

How  much  energy  must  be  dissipated  to  charge  a  capacitor? 

Abrupt  method  Quasi-static  Charging  (aka  'adiaba 


Quasi-static  Charging  +  Superconductivity 


Charging  through  a  super¬ 
conductor,  which  behaves  as 
an  inductor  and  resistor 
in  paralel 

E  =  ^CV2 


RC(LVR)2 

T3 


This  assumes  T  »  RC 
and  T  »  l/R. 


(To  implement  logic  in  this  system  would  require  superconducting  FETs. 

Such  FETs  are  possible,  but  there  have  been  very  few  experimental  results.) 


This  assumes  T  »  RC. 

(There’s  an  extra 
factor  of  7i2/8  for  a 
sinusoidal  ramp.) 
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Adiabatic  computing 

•  All  logic  transitions  must  be  directly 
driven  by  a  clock  waveform  passing 
through  FETs,  Output  is  cycled 
back  to  input 

•  Transitions  cannot  ripple 
through  statically  powered  gates 
as  in  conventional  logic. 

•  The  ramp  rate  of  the  clock 
waveforms  must  be  low  to  save 
energy. 

•  Try  never  to  turn  on  an  FET  while 
there  is  a  voltage  difference 
between  source  and  drain,  since 
this  would  result  in  dissipation. 

•  Requires  multiple  clock 
waveforms 
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Adiabatic  computing 

•  If  A  =  ‘1  ’  and  B  =  ‘O’,  or  A  =  ‘0’  ar 
B  =  ‘1  Charge  oscillates  betweei 
input  and  output. 

•  In  each  cycle  of  the  resonator 
(clock),  the  logic  signal  is 
created  and  then  removed  fro 
the  output  node  capacitor. 

•  If  A  =  B  =  ‘0’  or  A  =  B  =  ‘1  Output 
decoupled  from  input,  and  output 
kept  to  ground 

•  Clock  participates  energy,  per 
half  cycle 
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ON-state  Conductivity 
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Adiabatic  CMOS  penalty  factors  relative  to  conventional 
CMOS 


#FETs 

CPI 

CV2  Energy 

comment 

Dual  Rail 

2x 

- 

2x 

Twice  as  many 
nodes 

Logic  density 

3/4X 

- 

3/4X 

Dual  rail  =>  no 
inverters 

Uncompute 

1.5x 

1.5x 

2x 

Mixed  retractile 
and  pipeline 

Generate  control 
signals 

4/3  x 

- 

1.5x 

Clock  supply 

lx 

7x 

1.5x 

No  ripple:  drive 
every  transition 

Vqd  >  2VT 

lx 

- 

4x 

-double  supply 
voltage 

TOTALS 

3x 

lOx 

27x 

Incomplete  reversibility  also  sets  a  lower  limit  on  energy  savings. 
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Compare  to  adiabatic  to  CMOS 


Log  (relative  energy/operation) 
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Summary 

■  Classical  CMOS  has  still  life  left 

•  Fully  depleted  devices  will  offer  additional  knobs  for  gate  length  scaling 

•  Careful  analysis  of  power  delivery  and  I/Os  show  opportunities  to  reduce 
power  losses  in  the  system 

•  Low  voltage  operation  is  biggest  knob  for  power  reduction,  however  need 
massive  parallelism  to  get  performance  back  and  watch  out  for  a  low  voltage 
cache  solution 

Sub-threshold  engineering 

•  Steep  sub-threshold  devices  have  their  place  in  the  low  power  applications 

•  Need  to  push  devices  to  obtain  at  least  1 0%  on  current  as  conventional  device 

■  Adiabatic  computing 

•  Concepts  are  there,  not  clear  where  application  space  is 

•  This  needs  further  research 
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